-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vivekananda_session3_4VP20CS062_Prajwal #9
base: main
Are you sure you want to change the base?
Conversation
# This function will add the entry to database | ||
sql = """INSERT INTO members_blog (title, release_date, blog_time, author,created_date, content, recommended, html) VALUES (%s, %s::DATE, %s::TIME, %s, NOW(), %s, %s, %s)""" | ||
|
||
with conn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prajwal7Amin why with
is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anushds with statement ensures that the connection is closed. Here with statement is used to manage the database connection and ensures that the connection is open while executing the code block and automatically closes the connection when the block is exited
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prajwal7Amin In your code i see that you've started transaction(insert/truncate) but I don't see a commit
anywhere for it.
Is the data actually stored in the database ? If yes, then how the data is stored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anushds Yes, the data is stored in database. Data is stored into 'members_blog' table and each row contains the data corresponding to the columns defined like title, date, author, content, blog time etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prajwal7Amin well you've answered my question partially. Could you talk about the commit
that I've asked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anushds Yes, I have understood it now. So, in my code I have not explicitly called 'commit' method on 'conn' object (conn.commit). Because ' with conn: ' block handles the transaction and automatically commits the changes when it exits. So ' with conn:' automatically commits the transaction when exiting the block. This behavior is specified by the psycopg2 library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prajwal7Amin Good. The reason for this behaviour is because psycopg2 objects are context managers. When a error occurs while executing a query, the transaction is automatically rolled back by the context manager. If you're wondering why rolling back a transaction is important then try to experiment with this. To do this, you've to enclose the execute
line in a try-except block and ensure that the query breaks. Then continue to execute the next query. But make sure to print the exception otherwise you won't know what's going on.
Also I want you to look into converting the object of a class to context manager.
for con in contents:''' | ||
content = post.select('.post-body')[0].text | ||
html = 'S:\web_scapping\python_blogs.html' | ||
with open(html, 'w', encoding='utf-8') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prajwal7Amin You're writing to a file inside the loop. Do you think your file will contain the whole html?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anushds Yes, it contains the whole html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prajwal7Amin in the open
method, the first argument is the file name, what does that second argument indicate/ why is that used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anushds Second argument indicates the mode in which the file is opened. In this case w is the write mode.
It is used here to write the new data into the file (html = 'S:\web_scraping\python_blogs.html').
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prajwal7Amin Alright, so you've used write mode here.
Do you know what are modes are available? Could you brief about them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anushds 'r' -> Is the read mode. Opens the file for reading and is the default mode. Rises error is file doesn't exist
'w' -> Write mode. Opens the file for writing. If the file already exists, it truncates its contents. If the file doesn't exist, it creates a new file.
'x' -> Exclusive creation mode. Opens the file for writing only if it doesn't exist. If the file exists the operation fails(error).
'a' -> Append mode. Opens the file for writing and appends data to the end of it without truncating it. If the file doesn't exist it creates a new file.
'b' -> Binary mode. Used together with other modes like 'r' or 'w' to handle binary files. It is commonly used for reading or writing non-text files like images or audio files.
't' -> Text mode. This is the default mode and is used in conjunction with other modes like 'r' or 'w' to handle text files. It represents the file as a sequence of strings.
'+': Update mode. Open a file for updating (reading and writing). Used together with other modes, such as 'r+', 'w+', or 'a+'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prajwal7Amin So considering the explanation given by you for write mode, you're writing to a file inside the loop. So let me ask you the same question again, does your file contain the whole html?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anushds Thank you for noticing. Yes, you were right, the file does not contain the whole html. The ' python_blogs.html ' file contains the HTML content from the last iteration, representing the final webpage that was scraped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anushds Should I use append mode 'a' here instead of write mode?.
Is it right? Could you help me with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prajwal7Amin Good that you realised.
There are multiple approaches to this,
- Store the html contents of each iteration into a variable (say `html_contents'). After the loop ends write this into the file.
- Yea, you can open the file in append mode and then write to it. But I wouldn't suggest doing that because opening and closing file every iteration is a expensive operation (provided the number of calls made to OS). Sure the difference will be in seconds but even these "extra " seconds consumed by your code in doing this operation matters from a production perspective.
No description provided.